Fast Bit-Reversals on Uniprocessors and Shared-Memory Multiprocessors
نویسندگان
چکیده
In this paper, we examine different methods using techniques of blocking, buffering, and padding for efficient implementations of bit-reversals. We evaluate the merits and limits of each technique and its application and architecture-dependent conditions for developing cache-optimal methods. Besides testing the methods on different uniprocessors, we conducted both simulation and measurements on two commercial symmetric multiprocessors (SMP) to provide architectural insights into the methods and their implementations. We present two contributions in this paper: (1) Our integrated blocking methods, which match cache associativity and translation-lookaside buffer (TLB) cache size and which fully use the available registers, are cache-optimal and fast. (2) We show that our padding methods outperform other software-oriented methods, and we believe they are the fastest in terms of minimizing both CPU and memory access cycles. Since the padding methods are almost independent of hardware, they could be widely used on many uniprocessor workstations and multiprocessors.
منابع مشابه
The Use of Caching in Decoupled Multiprocessors with Shared Memory
In the following we evaluate the costs and beneets of using a cache memory with a decoupled architecture supporting shared memory in both the uniprocessor and multiprocessor cases. Firstly we identify the performance bottleneck of such architectures, which we deene as Loss of Decoupling costs. We show that in both uniprocessors and multiprocessor machines with high latency such costs can greatl...
متن کاملRSIM An Execution Driven Simulator for ILP Based Shared Memory Multiprocessors and Uniprocessors
This paper describes RSIM the Rice Simulator for ILP Multiprocessors Version RSIM sim ulates shared memory multiprocessors and unipro cessors built from processors that aggressively ex ploit instruction level parallelism ILP RSIM is execution driven and models state of the art ILP pro cessors an aggressive memory system and a multi processor coherence protocol and interconnect includ ing conten...
متن کاملFalse Sharing and Spatial Locality in Multiprocessor Caches
The performance of the data cache in shared-memory multiprocessors has been shown to be diierent from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can signiicantly limit the performance of multiprocessors....
متن کاملScalable Reader-Writer Synchronization for Shared-Memory Multiprocessors
Reader-writer synchronization relaxes the constraints of mutual exclusion to permit more than one process to inspect a shared object concurrently, as long as none of them changes its value. On uniprocessors, mutual exclusion and readerwriter locks are typically designed to de-schedule blocked processes; however, on shared-memory multiprocessors it is often advantageous to have processes busy wa...
متن کاملCache-Affinity Scheduling for Fine Grain Multithreading
Cache utilisation is often very poor in multithreaded applications, due to the loss of data access locality incurred by frequent context switching. This problem is compounded on shared memory multiprocessors when dynamic load balancing is introduced and thread migration disrupts cache content. In this paper, we present a technique, which we refer to as ‘batching’, for reducing the negative impa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- SIAM J. Scientific Computing
دوره 22 شماره
صفحات -
تاریخ انتشار 2001